33 research outputs found

    Efficient Extraction and Query Benchmarking of Wikipedia Data

    Get PDF
    Knowledge bases are playing an increasingly important role for integrating information between systems and over the Web. Today, most knowledge bases cover only specific domains, they are created by relatively small groups of knowledge engineers, and it is very cost intensive to keep them up-to-date as domains change. In parallel, Wikipedia has grown into one of the central knowledge sources of mankind and is maintained by thousands of contributors. The DBpedia (http://dbpedia.org) project makes use of this large collaboratively edited knowledge source by extracting structured content from it, interlinking it with other knowledge bases, and making the result publicly available. DBpedia had and has a great effect on the Web of Data and became a crystallization point for it. Furthermore, many companies and researchers use DBpedia and its public services to improve their applications and research approaches. However, the DBpedia release process is heavy-weight and the releases are sometimes based on several months old data. Hence, a strategy to keep DBpedia always in synchronization with Wikipedia is highly required. In this thesis we propose the DBpedia Live framework, which reads a continuous stream of updated Wikipedia articles, and processes it. DBpedia Live processes that stream on-the-fly to obtain RDF data and updates the DBpedia knowledge base with the newly extracted data. DBpedia Live also publishes the newly added/deleted facts in files, in order to enable synchronization between our DBpedia endpoint and other DBpedia mirrors. Moreover, the new DBpedia Live framework incorporates several significant features, e.g. abstract extraction, ontology changes, and changesets publication. Basically, knowledge bases, including DBpedia, are stored in triplestores in order to facilitate accessing and querying their respective data. Furthermore, the triplestores constitute the backbone of increasingly many Data Web applications. It is thus evident that the performance of those stores is mission critical for individual projects as well as for data integration on the Data Web in general. Consequently, it is of central importance during the implementation of any of these applications to have a clear picture of the weaknesses and strengths of current triplestore implementations. We introduce a generic SPARQL benchmark creation procedure, which we apply to the DBpedia knowledge base. Previous approaches often compared relational and triplestores and, thus, settled on measuring performance against a relational database which had been converted to RDF by using SQL-like queries. In contrast to those approaches, our benchmark is based on queries that were actually issued by humans and applications against existing RDF data not resembling a relational schema. Our generic procedure for benchmark creation is based on query-log mining, clustering and SPARQL feature analysis. We argue that a pure SPARQL benchmark is more useful to compare existing triplestores and provide results for the popular triplestore implementations Virtuoso, Sesame, Apache Jena-TDB, and BigOWLIM. The subsequent comparison of our results with other benchmark results indicates that the performance of triplestores is by far less homogeneous than suggested by previous benchmarks. Further, one of the crucial tasks when creating and maintaining knowledge bases is validating their facts and maintaining the quality of their inherent data. This task include several subtasks, and in thesis we address two of those major subtasks, specifically fact validation and provenance, and data quality The subtask fact validation and provenance aim at providing sources for these facts in order to ensure correctness and traceability of the provided knowledge This subtask is often addressed by human curators in a three-step process: issuing appropriate keyword queries for the statement to check using standard search engines, retrieving potentially relevant documents and screening those documents for relevant content. The drawbacks of this process are manifold. Most importantly, it is very time-consuming as the experts have to carry out several search processes and must often read several documents. We present DeFacto (Deep Fact Validation), which is an algorithm for validating facts by finding trustworthy sources for it on the Web. DeFacto aims to provide an effective way of validating facts by supplying the user with relevant excerpts of webpages as well as useful additional information including a score for the confidence DeFacto has in the correctness of the input fact. On the other hand the subtask of data quality maintenance aims at evaluating and continuously improving the quality of data of the knowledge bases. We present a methodology for assessing the quality of knowledge bases’ data, which comprises of a manual and a semi-automatic process. The first phase includes the detection of common quality problems and their representation in a quality problem taxonomy. In the manual process, the second phase comprises of the evaluation of a large number of individual resources, according to the quality problem taxonomy via crowdsourcing. This process is accompanied by a tool wherein a user assesses an individual resource and evaluates each fact for correctness. The semi-automatic process involves the generation and verification of schema axioms. We report the results obtained by applying this methodology to DBpedia

    Using Semantic Web Technologies to Query and Manage Information within Federated Cyber-Infrastructures

    Get PDF
    A standardized descriptive ontology supports efficient querying and manipulation of data from heterogeneous sources across boundaries of distributed infrastructures, particularly in federated environments. In this article, we present the Open-Multinet (OMN) set of ontologies, which were designed specifically for this purpose as well as to support management of life-cycles of infrastructure resources. We present their initial application in Future Internet testbeds, their use for representing and requesting available resources, and our experimental performance evaluation of the ontologies in terms of querying and translation times. Our results highlight the value and applicability of Semantic Web technologies in managing resources of federated cyber-infrastructures.EC/FP7/318389/EU/Federation for FIRE/Fed4FIREEC/FP7/732638/EU/Federation for FIRE Plus/Fed4FIREplu

    A Learning Based Framework for Improving Querying on Web Interfaces of Curated Knowledge Bases

    Get PDF
    Knowledge Bases (KBs) are widely used as one of the fundamental components in Semantic Web applications as they provide facts and relationships that can be automatically understood by machines. Curated knowledge bases usually use Resource Description Framework (RDF) as the data representation model. To query the RDF-presented knowledge in curated KBs, Web interfaces are built via SPARQL Endpoints. Currently, querying SPARQL Endpoints has problems like network instability and latency, which affect the query efficiency. To address these issues, we propose a client-side caching framework, SPARQL Endpoint Caching Framework (SECF), aiming at accelerating the overall querying speed over SPARQL Endpoints. SECF identifies the potential issued queries by leveraging the querying patterns learned from clients’ historical queries and prefecthes/caches these queries. In particular, we develop a distance function based on graph edit distance to measure the similarity of SPARQL queries. We propose a feature modelling method to transform SPARQL queries to vector representation that are fed into machine-learning algorithms. A time-aware smoothing-based method, Modified Simple Exponential Smoothing (MSES), is developed for cache replacement. Extensive experiments performed on real-world queries showcase the effectiveness of our approach, which outperforms the state-of-the-art work in terms of the overall querying speed

    Efficient Extraction and Query Benchmarking of Wikipedia Data

    Get PDF
    Knowledge bases are playing an increasingly important role for integrating information between systems and over the Web. Today, most knowledge bases cover only specific domains, they are created by relatively small groups of knowledge engineers, and it is very cost intensive to keep them up-to-date as domains change. In parallel, Wikipedia has grown into one of the central knowledge sources of mankind and is maintained by thousands of contributors. The DBpedia (http://dbpedia.org) project makes use of this large collaboratively edited knowledge source by extracting structured content from it, interlinking it with other knowledge bases, and making the result publicly available. DBpedia had and has a great effect on the Web of Data and became a crystallization point for it. Furthermore, many companies and researchers use DBpedia and its public services to improve their applications and research approaches. However, the DBpedia release process is heavy-weight and the releases are sometimes based on several months old data. Hence, a strategy to keep DBpedia always in synchronization with Wikipedia is highly required. In this thesis we propose the DBpedia Live framework, which reads a continuous stream of updated Wikipedia articles, and processes it. DBpedia Live processes that stream on-the-fly to obtain RDF data and updates the DBpedia knowledge base with the newly extracted data. DBpedia Live also publishes the newly added/deleted facts in files, in order to enable synchronization between our DBpedia endpoint and other DBpedia mirrors. Moreover, the new DBpedia Live framework incorporates several significant features, e.g. abstract extraction, ontology changes, and changesets publication. Basically, knowledge bases, including DBpedia, are stored in triplestores in order to facilitate accessing and querying their respective data. Furthermore, the triplestores constitute the backbone of increasingly many Data Web applications. It is thus evident that the performance of those stores is mission critical for individual projects as well as for data integration on the Data Web in general. Consequently, it is of central importance during the implementation of any of these applications to have a clear picture of the weaknesses and strengths of current triplestore implementations. We introduce a generic SPARQL benchmark creation procedure, which we apply to the DBpedia knowledge base. Previous approaches often compared relational and triplestores and, thus, settled on measuring performance against a relational database which had been converted to RDF by using SQL-like queries. In contrast to those approaches, our benchmark is based on queries that were actually issued by humans and applications against existing RDF data not resembling a relational schema. Our generic procedure for benchmark creation is based on query-log mining, clustering and SPARQL feature analysis. We argue that a pure SPARQL benchmark is more useful to compare existing triplestores and provide results for the popular triplestore implementations Virtuoso, Sesame, Apache Jena-TDB, and BigOWLIM. The subsequent comparison of our results with other benchmark results indicates that the performance of triplestores is by far less homogeneous than suggested by previous benchmarks. Further, one of the crucial tasks when creating and maintaining knowledge bases is validating their facts and maintaining the quality of their inherent data. This task include several subtasks, and in thesis we address two of those major subtasks, specifically fact validation and provenance, and data quality The subtask fact validation and provenance aim at providing sources for these facts in order to ensure correctness and traceability of the provided knowledge This subtask is often addressed by human curators in a three-step process: issuing appropriate keyword queries for the statement to check using standard search engines, retrieving potentially relevant documents and screening those documents for relevant content. The drawbacks of this process are manifold. Most importantly, it is very time-consuming as the experts have to carry out several search processes and must often read several documents. We present DeFacto (Deep Fact Validation), which is an algorithm for validating facts by finding trustworthy sources for it on the Web. DeFacto aims to provide an effective way of validating facts by supplying the user with relevant excerpts of webpages as well as useful additional information including a score for the confidence DeFacto has in the correctness of the input fact. On the other hand the subtask of data quality maintenance aims at evaluating and continuously improving the quality of data of the knowledge bases. We present a methodology for assessing the quality of knowledge bases’ data, which comprises of a manual and a semi-automatic process. The first phase includes the detection of common quality problems and their representation in a quality problem taxonomy. In the manual process, the second phase comprises of the evaluation of a large number of individual resources, according to the quality problem taxonomy via crowdsourcing. This process is accompanied by a tool wherein a user assesses an individual resource and evaluates each fact for correctness. The semi-automatic process involves the generation and verification of schema axioms. We report the results obtained by applying this methodology to DBpedia

    Effect of Some Calcium Channel Blockers in Experimentally Induced Diabetic Nephropathy in Rats

    No full text
    Diabetic nephropathy (DNP) is considered a CRD (Chronic Renal Disease); it is a major cause of illness and premature death in people with DM. The present study was designed to illustrate the role of CCBs (amlodipine and diltiazem) in prevention and treatment of DNP in rats. Eighty male albino rats weighing (130-180gm) were used in this study. These animals were subdivided into five equal groups. Insulinopenic diabetes was induced by STZ, two weeks later, 30 minutes of complete ischemia was induced in the left kidney to induce diabetic nephropathy then treatment was started for 12 weeks. At the end of experiment urine samples and blood samples were taken for biochemical analysis and kidneys were taken for histopathological evaluation. Combination of renal ischemia with DM produced a significant increase in rat weight, rat kidney weight, BUN (Blood Urea Nitrogen) level, K/B (Kidney/Body weight) ratio, random blood glucose, 24 hrs urine proteins, and 24 hrs urine volumes and creatinine clearance. Treatment with diltiazem or amlodipine significantly lowered elevated SBP and elevated 24 hrs urine volumes. Furthermore, treatment with captopril produced a highly significant lowering of elevated SBP and elevated serum creatinine; and a significant reduction in elevated K/B ratio and proteinuria. Light microscopic examination of diabetic kidneys revealed glomerulopathy characterized by thickening of the glomerular basement membrane, mesangial matrix expansion, arteriolar hyalinosis and large proteinaceous deposits occluding some capillary loops and hyaline droplets within the glomeruli. Moreover, examination of kidneys of ischemic animals by light microscope revealed focal tubular necrosis at multiple points along the nephron, interstitial edema and accumulation of leucocytes within dilated vasa recta

    Defacto - deep fact validation

    No full text
    Abstract. One of the main tasks when creating and maintaining knowledge bases is to validate facts and provide sources for them in order to ensure correctness and traceability of the provided knowledge. So far, this task is often addressed by human curators in a three-step process: issuing appropriate keyword queries for the statement to check using standard search engines, retrieving potentially relevant documents and screening those documents for relevant content. The drawbacks of this process are manifold. Most importantly, it is very time-consuming as the experts have to carry out several search processes and must often read several documents. In this article, we present DeFacto (Deep Fact Validation) -an algorithm for validating facts by finding trustworthy sources for it on the Web. DeFacto aims to provide an effective way of validating facts by supplying the user with relevant excerpts of webpages as well as useful additional information including a score for the confidence DeFacto has in the correctness of the input fact

    User-driven Quality Evaluation of DBpedia

    No full text
    Linked Open Data (LOD) comprises of an unprecedented volume of structured datasets on the Web. However, these datasets are of varying quality ranging from extensively curated datasets to crowdsourced and even extracted data of relatively low quality. We present a methodology for assessing the quality of linked data resources, which comprises of a manual and a semi-automatic process. The first phase includes the detection of common quality problems and their representation in a quality problem taxonomy. In the manual process, the second phase comprises of the evaluation of a large number of individual resources, according to the quality problem taxonomy via crowdsourcing. This process is accompanied by a tool wherein a user assesses an individual resource and evaluates each fact for correctness. The semi-automatic process involves the generation and verification of schema axioms. We report the results obtained by applying this methodology to DBpedia. We identified 17 data quality problem types and 58 users assessed a total of 521 resources. Overall, 11.93% of the evaluated DBpedia triples were identified to have some quality issues. Applying the semi-automatic component yielded a total of 222,982 triples that have a high probability to be incorrect. In particular, we found that problems such as object values being incorrectly extracted, irrelevant extraction of information and broken links were the most recurring quality problems. With this study, we not only aim to assess the quality of this sample of DBpedia resources but also adopt an agile methodology to improve the quality in future versions by regularly providing feedback to the DBpedia maintainers

    Late onset seroma post-thymectomy presenting as cardiac tamponade

    No full text
    Late onset seroma is a rare post-operative complication occurring after various surgeries including thymectomy. Most cases are asymptomatic; however, seromas occurring in the mediastinal cavity may cause compression symptoms including airway compression or cardiac tamponade. We present a 62-year-old male with a history of thymectomy for myasthenia gravis who presented with cardiac tamponade several years ago. Further evaluation revealed a late onset seroma anteriorly compressing the cardiac chambers resulting in tamponade physiology
    corecore